In this investigation, I wanted to look at the characteristics of duration that could be used to predict their start time. The main focus was on the start date, and genders of users.
This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.graph_objects as go
import plotly.express as px
%matplotlib inline
# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")
# load in the dataset into a pandas dataframe
tripdata = pd.read_csv('tripdata.csv')
# Convert start_time and end_time to datetime for analysis
tripdata['start_time'] = pd.to_datetime(tripdata.start_time, format='%Y-%m-%d %H:%M:%S')
tripdata['end_time'] = pd.to_datetime(tripdata.end_time, format='%Y-%m-%d %H:%M:%S')
# Generate day counts for February
tripdata['start_time_date'] = tripdata['start_time'].dt.to_period('D').astype(str)
# Generate day counts for days of week
start_time_day = tripdata['start_time'].dt.dayofweek.astype(str)
start_time_day = start_time_day.replace({"0" : 'Monday', "1" : 'Tuesday', "2": 'Wednesday', "3" : 'Thursday', "4" : 'Friday', "5" : 'Saturday',
"6" : 'Sunday'})
# Inserting the values into the tripdata dataframe
tripdata['start_time_day'] = start_time_day
Trip Lengths in the dataset take on a very large range of values, from about 0 at the lowest, to about 82,000 at the highest. Plotted on a logarithmic scale, the distribution of diamond prices takes on a right-skewed shape.
The duration variable took on a large range of values, so I looked at the data using a log transform.
# There's a long tail in the distribution, so let's put it on a log scale instead
fig = px.histogram(
data_frame=tripdata,
# Set up the x-axis
x="duration_sec",
title='Trip Length Distribution (log_x=True)',
# Logarithmic axes with Plotly Express
log_x=True)
# Show the plot
fig.show()
We will take a look at how the day of the week affects the amount of trips made in a day. Also, the days of the week and date columns had to be changed to strings so that I could generate the histogram.
# Investigating further by day
fig = px.histogram(
data_frame=tripdata,
x="start_time_day",
title='Trip Start Time (by Day of Week)')
# Fixing the x-axis order
fig.update_xaxes(categoryorder='array', categoryarray= ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
fig.show()
There's a strong positive relationship between end_station_latitude and start_station_latitude. Also between end_station_longitude and start_station_longitude. On the other hand, we can also see a strong negative relationship betwen start_station_longitude and start_station_latitude and end_station_longitude and start_station_latitude, as well as end_station_longitude and end_station_latitude.
cr = tripdata.corr(method='pearson')
fig = go.Figure(go.Heatmap(
x=cr.columns,
y=cr.columns,
z=cr.values.tolist(),
colorscale='rdylgn', zmin=-1, zmax=1))
fig.update_layout(
title="Correlation Heatmap Across All The Columns",
)
fig.show()
We can see that there's no significant differences in these plots. It is possible that users are a bit younger on Tuesdays and Fridays than usual, as the lower side of IQR goes down more than the other days of the week.
# Generating the box plots for each variable.
tripdata_samp = tripdata.sample(n=3000, replace = False)
def boxgrid(x, y, **kwargs):
default_color = sb.color_palette()[0]
sb.boxplot(x=x, y=y, color=default_color, showfliers = False)
plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = tripdata_samp, y_vars = ['duration_sec', 'member_birth_year'], x_vars = ['member_gender', 'start_time_day'],
height = 4, aspect = 1.6)
g.map(boxgrid)
plt.title('Box Plot On Days, Duration, and Birth Year', x=-0.1, y=2.2)
plt.show();
<Figure size 720x720 with 0 Axes>
Birth Year had a surprisingly high amount of correlation with the duration of the ride. An approximately exponential relationship was observed when duration was plotted. Box plots tell us that there aren't huge differences across the gender of user, and the day of the week. There was also an interesting relationship observed between start_time and end_time. start_station_longitude. On the other hand, we can also see a strong negative relationship betwen start_station_longitude and start_station_latitude and end_station_longitude and start_station_latitude, as well as end_station_longitude and end_station_latitude.
plt.scatter(data = tripdata, x = 'member_birth_year', y = 'duration_sec');
plt.xlabel('Birth Year')
plt.ylabel('Duration')
plt.title('Correlation on Birth Year and Duration')
Text(0.5, 1.0, 'Correlation on Birth Year and Duration')
I extended my investigation of start time against duration in this section by looking at the impact of the three categorical quality features. The multivariate exploration here showed that there is an increased number of values on birth time when younger, but in the second plot, it is hard to see any relationship from this one.
Looking at the point plots, it doesn't appear that the three category features have a systematic interaction impact. The features, on the other hand, aren't completely self-contained.
plt.hist2d(data = tripdata, x = 'start_time', y = 'member_birth_year', weights = 'duration_sec',
cmap = 'viridis_r');
plt.xlabel('Start Time')
plt.ylabel('Birth Year');
plt.colorbar(label = 'Duration');
plt.title('Correlation Between Duration, Age and Startdate', y=1.08)
Text(0.5, 1.08, 'Correlation Between Duration, Age and Startdate')
Let's see how the days of the week are related to duration and start time. It's fascinating to see how the start time plot for duration relates to the days of the week.
# select duration of approximately 1 hour
tripdata_hour = (tripdata['duration_sec'] >= 3550) & (tripdata['duration_sec'] <= 3650)
tripdata_one = tripdata.loc[tripdata_hour,:]
fig = plt.figure(figsize = [8,6])
ax = sb.pointplot(data = tripdata_one, x = 'start_time', y = 'duration_sec', hue = 'start_time_day',
palette = 'Blues', linestyles = '', dodge = 0.4)
plt.ylabel('duration_sec')
plt.title('Correlation Between Duration, Startdate and Day of The Week')
plt.show();
!jupyter nbconvert Part_II_slide_deck_template.ipynb --to slides --post serve --no-input --no-prompt
[NbConvertApp] Converting notebook Part_II_slide_deck_template.ipynb to slides [NbConvertApp] Writing 7019974 bytes to Part_II_slide_deck_template.slides.html [NbConvertApp] Redirecting reveal.js requests to https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.5.0 Serving your slides at http://127.0.0.1:8000/Part_II_slide_deck_template.slides.html Use Control-C to stop this server WARNING:tornado.access:404 GET /plotly.js (127.0.0.1) 0.78ms